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We study Bayesian discriminative inference given a model family p{c, x, 6) 
that is assumed to contain all our prior information but still known to be incorrect. 
This falls in between "standard" Bayesian generative modeling and Bayesian re- 
gression, where the margin p(x, 9) is known to be uninformative about p(c|x, 6). 
^^ , We give an axiomatic proof that discriminative posterior is consistent for condi- 

fT^ ' tional inference; using the discriminative posterior is standard practice in classical 

Bayesian regression, but we show that it is theoretically justified for model families 
of joint densities as well. A practical benefit compared to Bayesian regression is 
that the standard methods of handling missing values in generative modeling can 
be extended into discriminative inference, which is useful if the amount of data 
is small. Compared to standard generative modeling, discriminative posterior re- 
sults in better conditional inference if the model family is incorrect. If the model 
J^> , family contains also the true model, the discriminative posterior gives the same re- 

» I ■ suit as standard Bayesian generative modeling. Practical computation is done with 

C^ ' Markov chain Monte Carlo. 

1 Introduction 

Our aim is Bayesian discriminative inference in the case where the model family 
p{c, X, 9) is known to be incorrect. Here x is a data vector and c its class, and the 
9 are parameters of the model family. By discriminative we mean predicting the con- 
ditional distribution p(c I x). 

The Bayesian approach of using the posterior of the generative model family 
p{c, X, 9) has not been shown to be justified in this case, and it is known that it does 
not always generalize well to new data (in case of point estimates, see for example 



[1, 2, 3]; in this paper we provide a toy example that illustrates the fact for posterior 
distributions). Therefore alternative approaches such as Bayesian regression are ap- 
plied [4]. It can be argued that the best solution is to improve the model family by 
incorporating more prior knowledge. This is not always possible or feasible, however, 
and simplified models are being generally used, often with good results. For example, 
it is often practical to use mixture models even if it is known a priori that the data can- 
not be faithfully described by them (see for example [5]). There are good reasons for 
still applying Bayesian-style techniques [6] but the general problem of how to best do 
inference with incorrect model families is still open. 

In practice, the usual method for discriminative tasks is Bayesian regression. It dis- 
regards all assumptions about the distribution of x, and considers x only as covariates 
of the model for c. Bayesian regression may give superior results in discriminative 
inference, but the omission of a generative model for x (although it may be readily 
available) makes it difficult to handle missing values in the data. Numerous heuristic 
methods for imputing missing values have been suggested, see for example [7], but no 
theoretical arguments of their optimality have been presented. Here we assume that 
we are given a generative model family of the full data (x, c), and therefore have a 
generative mechanism readily available for imputing missing values. 

From the generative modeling perspective, Bayesian regression ignores any infor- 
mation about c supplied by the marginal distribution of x. This is justified if (i) the 
covariates are explicitly chosen when designing the experimental setting and hence are 
not noisy, or (ii) there is a separate set of parameters for generating x on the one hand 
and c given x on the other, and the sets are assumed to be independent in their prior 
distribution. In the latter case the posterior factors out into two parts, and the parame- 
ters used for generating x are neither needed nor useful in the regression task. See for 
instance [4, 8] for more details. However, there has been no theoretical justification for 
Bayesian regression in the more general setting where the independence does not hold. 

For point estimates of generative models it is well known that maximizing the joint 
likelihood and the conditional likelihood give in general different results. Maximum 
conditional likelihood gives asymptotically a better estimate of the conditional like- 
lihood [2], and it can be optimized with expectation-maximization-type procedures 
[9, 10]. In this paper we extend that line of work to show that the two different 
approaches, joint and conditional modeling, result in different posterior distributions 
which are asymptotically equal only if the true model is within the model family. We 
give an axiomatic justification to the discriminative posterior, and demonstrate empiri- 
cally that it works as expected. If there are no covariates, the discriminative posterior is 
the same as the standard posterior of joint density modeling, that is, ordinary Bayesian 
inference. 

To our knowledge the extension from point estimates to a posterior distribution is 
new. We are aware of only one suggestion, the so-called supervised posterior [11], 
which also has empirical support in the sense of maximum a posteriori estimates [12]. 
The posterior has, however, only been justified heuristically. 

For the purpose of regression, the discriminative posterior makes it possible to 
use more general model structures than standard Bayesian regression; in essence any 
generative model family can be used. In addition to giving a general justification 



to Bayesian regression-type modeling, predictions given a generative model family 
p{c, X, 9) should be better if the whole model is (at least approximately) correct. The 
additional benefit is that the use of the full generative model gives a principled way of 
handling missing values. The gained advantage, compared to using the standard non- 
discriminative posterior, is that the predictions should be more accurate assuming the 
model family is incorrect. 

In this paper, we present the necessary background and definitions in Section 2. 
The discriminative posterior is derived briefly from a set of five axioms in Section 3; 
the full proof is included as an appendix. There is a close resemblance to Cox axioms, 
and standard Bayesian inference can indeed be derived also from this set. However, 
the new axioms allow also inference in the case where the model manifold is known to 
be inadequate. In section 4 we show that discriminative posterior can be extended in a 
standard manner [7] to handle data missing at random. In Section 5 we present some 
experimental evidence that the discriminative posterior behaves as expected. 

2 Aims and Definitions 

In this paper, we prove the following two claims; the claims follow from Theorem 3.1, 
discriminative posterior, which is the main result of this paper. 

Well-known Given a discriminative model, a model p{c \ x; 6) for the conditional 
density, Bayesian regression results in consistent conditional inference. 

New Given a joint density model p{c, x \ 9) , discriminative posterior results in con- 
sistent conditional inference. 

In accordance with [13], we call inference consistent if the utility is maximized with 
large data sets. This paper proves both of the above claims. Notice that although the 
claim 1 is well known, it has not been proven, aside from the special case where the 
priors for the margin x and c | x are independent, as discussed in the introduction and 
in [4]. 

2.1 Setup 

Throughout the paper, observations are denoted by (c, x), and assumed to be i.i.d. We 
use 8 to denote the set of all possible models that could generate the observations. 
Models that are applicable in practice are restricted to a lower dimensional manifold O 
of models, O C 6. In other words, the subspace 8 defines our model family, in this 
work denoted by a distribution p{c, x, 9) parameterized hy 9 <E Q. 

There exists a model in 8 which describes the "true" model, which has actually 
generated the observations and is typically unknown. With slight abuse of notation 
we denote this model by £ 8, with the understanding that it may be outside our 
parametric model family. In fact, in practice no probabilistic model is perfectly true 
and is false to some extent, that is, the data has usually not been generated by a model 
in our model family, 9 ^ Q. 



The distribution induced in the model parameter space O after observing the data 
D = {(c, x)}"^j^ is referred to as a posterior. By standard posterior we mean the 
posterior obtained from Bayes formula using a full joint density model. In this paper 
we discuss the discriminative posterior, which is obtained from axioms 1-5 below. 

2.2 Utility Function and Point Estimates 

In this subsection, we introduce the research problem by recapitulating the known dif- 
ference between point estimates of joint and conditional likelihood. We present the 
point estimates in terms of Kullback-Leibler divergences, in a form that allows gener- 
alizing from point estimates to the posterior distribution in section 3. 

Bayesian inference can be derived in a decision theoretic framework as maximiza- 
tion of the expected utility of the decision maker [14]. In general, the choice of the 
utility function is subjective. However, several arguments for using log-probability as 
utility function can be made, see for example [14]. 

When inspecting a full generative model at the limit where the amount of data is in- 
finite, the joint posterior distribution pj (61 I D) oc p{9)Y[(cx)eDPi^>^ I 6') becomes a 
point solution, pj{6 \ D) — 5{6 — 6)} An accurate approximation of the log-likelihood 
is produced by a utility function minimizing the approximation error K joint between 
the point estimate joint and the true model as follows: 

6 JOINT = SiigTa:inKjoiNT(0,9) where 



K JOINT {0,9) 



J2fp(c,^\e)logP^f^d^, (1) 



If the true model is in the model family, that is, 9 £ Q, equation (1) can be minimized 
to zero and the resulting point estimate is effectively the MAP solution. If 6* ^ 6 the 
resulting point estimate is the best estimate of the true joint distribution p(c, x | 9) with 
respect to K joint ■ 

However, the joint estimate may not be optimal if we are interested in approximat- 
ing some other quantity than the likelihood. Consider the problem of finding the best 
point estimate 9cond for the conditional distribution p{c \ x, 9). The average KL- 
divergence between the true conditional distribution at 9 and its estimate at 9 is given 
by 



Kcond{9, 9) = [p{^\ e~) Vp(c I X, 9) log ^J4^ ^^ 
J ^-^ V{c\ X, 9) 



(2) 



and the best point estimate with respect to Kqond is 

9coND = a,TgmmKcoND{9,9) . (3) 

flee 

By equations (1) and (2) we may write 

Kjoint{9, 9) = Kcond{9, 9) + /p(x | ~9) log 4^^ dx . (4) 

J P(x I 9) 



'strictly speaking, the posterior can also have multiple modes; we will not treat these special cases here 
but they do not restrict the generality. 



Therefore the point estimates 9 joint and Ocond are different in general. If the model 
that has generated the data does not belong to the model family, that is ^ O, then 
by Equation (4) the joint estimate is generally worse than the conditional estimate in 
conditional inference, measured in terms of conditional likelihood. See also [2]. 

2.3 Discriminative versus generative models 

A discriminative model does not make any assumptions on the distribution of the mar- 
gin of X. That is, it does not incorporate a generative model for x, and can be interpreted 
to rather use the empirical distribution of p(x) as its margin [15]. 

A generative model, on the other hand, assumes a specific parametric form for the 
full distribution p(c, x, 6*). A generative model can be learned either as a joint density 
model or in a discriminative manner. Our point in this paper is that the selection corre- 
sponds to choosing the utility function; this is actually our fifth axiom in section 3. In 
joint density modeling the utility is to model the full distribution p(c, x) as accurately 
as possible, which corresponds to computing the standard posterior incorporating the 
likelihood function. In discriminative learning the utility is to model the conditional 
distribution of p{c \ x), and the result is a discriminative posterior incorporating the 
conditional likelihood. A generative model optimized in discriminative manner is re- 
ferred to as a discriminative joint density model in the following. 

For generative models and in case of point estimates, if the model family is correct, 
a maximum likelihood (ML) solution is better for predicting p(c|x) than maximum 
conditional likelihood (CML). They have the same maximum, but the asymptotic vari- 
ance of CML estimate is higher [16]. However, in case of incorrect models, a maximum 
conditional likelihood estimate is better than maximum likelihood [2]. For an example 
illustrating that CML can be better than ML in predicting p(c|x), see [17]. 

Since the discriminative joint density model has a more restricted model structure 
than Bayesian regression, we expect it to perform better with small amounts of data. 
More formally, later in Theorem 3. 1 we move from point estimate of Equation (3) to a 
discriminative posterior distribution prf(0 | I?) over the model parameters 6* in 8. Since 
the posterior is normalized to unity, J-^PdiO \ D)d6 ~ 1, the values of the posterior 
Pd{9 I D) are generally smaller for larger model families; the posterior is more diffuse. 
Equation (2) can be generalized to the expectation of the approximation error. 
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Kcond{9, 0) = - I ^p(x, c I e)pd{0 I D) \ogp{c I x, 6i)dxd0+const. 

(5) 
The expected approximation error is small when both the discriminative posterior dis- 
tribution pd{0 I D) and the conditional likelihood p{c \ x, 6) are large at the same 
time. For small amounts of data, if the model family is too large, the values of the pos- 
terior pd(^ I D) are small. The discriminative joint density model has a more restricted 
model family than that of the Bayesian regression, and hence the values of the poste- 
rior are larger. If the model is approximately correct, the discriminative joint density 
model will have a smaller approximation error than the Bayesian regression, that is, 
p{c I X, 6) is large somewhere in Q. This is analogous to selecting the model family 
that maximizes the evidence (in our case the expected conditional log-likelihood) in 



Bayesian inference; choosing a model family that is too complex leads to small evi- 
dence (see, e.g., [18]). The difference to the traditional Bayesian inference is that we 
do not require that the true data generating distribution 9 is contained in the parameter 
space Q under consideration. 

3 Axiomatic Derivation of Discriminative Posterior 

In this section we generalize the point estimate Ocond presented in section 2.2 to a 
discriminative posterior distribution over 9 G O. 

Theorem 3.1 (Discriminative posterior distribution) It follows from axioms 1-6 listed 
below that, given data D — {(q, Xi)}"^j^, the discriminative posterior distribution 
Pd{0 I D) is of the form 

Pd{9\D)^p(9) J] P(c|x,0) . 

The predictive distribution for new ic, obtained by integrating over this posterior, p{c 
X, D) = J Pd{() I D)p{c I X, 9) d9, is consistent for conditional inference. That is, pd 
is consistent for the utility of conditional likelihood. 

The discriminative posterior follows from requiring the following axioms to hold: 

1. The posterior pd{9 \ D) can be represented by non-negative real numbers that 
satisfy iQPd{9 \ D)d9 = 1. 

2. A model G 8 can be represented as a function h{{c,x),9) that maps the 
observations (c, x) to real numbers. 

3. The posterior, after observing a data set D followed by an observation (c, x), is 
given hy pd{9 \ D\j{c,x)) — F{h{{c,x),9),pd{9 \ D)), where F is a twice 
differentiable function in both of its parameters. 

4. Exchangeability: The value of the posterior is independent of the ordering of the 
observations. That is, the posterior after two observations (c,x) and (c',x') is 
the same irrespective of their ordering: 

F{h{ic',x'),9),pd{9 I (c,x) Ui?)) = F{h{{c,x),9),pd{9 \ {c',x')UD)). 

5. The posterior must agree with the utility. For 9 E Q, and 9i,92 G O, the 
following condition is satisfied: 

PdiOi I ^g) < Pdi02 I Dg) ^ Kcond{0, 9,) > KcondU ^2) , 

where Dg is a very large data set sampled from p(c, x \ 9). We further assume 
that the discriminative posteriors pd at 9i and ^2 are equal only if the correspond- 
ing conditional KL-divergences Kcond are equal. 



The first axiom above is simply a requirement that the posterior is a probabiUty distri- 
bution in the parameter space. 

The second axiom defines a model in general terms; we define it as a mapping from 
event space into a real number 

The third axiom makes smoothness assumptions on the posterior. The reason for 
the axiom is technical; the smoothness is used in the proofs. Cox [19] makes similar 
assumptions, and our proof therefore holds in the same scope as Cox's; see [20]. 

The fourth axiom requires exchangeability; the shape of the posterior distribution 
should not depend on the order in which the observations are made. This deviates 
slightly from analogous earlier proofs for standard Bayesian inference, which have 
rather used the requirement of associativity [19] or included it as an additional con- 
straint in modeling after presenting the axioms [14]. 

The fifth axiom states, in essence, that asymptotically (at the limit of a large but 
finite data set) the shape of the posterior is such that the posterior is always smaller if 
the "distance" KcoNoid, 6) is larger. If the opposite would be true, the integral over 
the discriminative posterior would give larger weight to solutions further away from 
the true model, leading to a larger error measured by KcoNoid, 0). 

Axioms 1-5 are sufficient to fix the discriminative posterior, up to a monotonic 
transformation. To fix the monotonic transformation we introduce the sixth axiom: 

6. For fixed x the model reduces to the standard posterior. For the data set Dx = 
{(c, x') ^ D \ x.' — x}, the discriminative posterior prf (6' | Dx) matches the 
standard posterior 

p-{c\9)=p{c\x,d). 

We use p(9) = pd{0 \ 0) to denote the posterior when no data is observed (prior 
distribution). 

Proof The proof is in the Appendix. For clarity of presentation, we additionally sketch 
the proof in the following. 

Proposition 3.2 {F is isomorphic to multiplication) It follows from axiom 4 that the 
function F is of the form 

f{F(h((c,x),6),pa(6 I D)) ex M(c,x), 0)/(pd(0 | D)), (6) 

where / is a monotonic function which we, by convention, fix to the identity function.^ 
The problem then reduces to finding a functional form for /i((c, x), 0). Utilizing 
both the equality part and the inequality part of axiom 5, the following proposition can 
be derived. 

Proposition 3.3 It follows from axiom 5 that 

h{{c,x),e) (y:p{c\x,e)'^ where A>0 . (7) 

Finally, the axiom 6 effectively states that we decide to follow the Bayesian convention 
for a fixed x, that is, to set A = 1. 



^We follow here Cox [19] and subsequent work, see e.g. [20]. A difference which does not affect the 
need for the convention is that usually multiplicativity is derived based on the assumption of associativity, 
not exchangeability. Our setup is slightly different from Cox, since instead of updating beliefs on events, we 
here consider updating beliefs in a family of models. 



4 Modeling Missing Data 



New Discriminative posterior gives a theoretically justified way of handling missing 
values in discriminative tasks. 

Discriminative models cannot readily handle components missing from the data vector, 
since the data is used only as covariates. However, standard methods of handling miss- 
ing data with generative models [7] can be applied with the discriminative posterior 

The additional assumption we need to make is a model for which data are missing. 
Below we derive the formulas for the common case of data missing independently at 
random. Extensions incorporating prior information of the process by which the data 
are missing are straightforward, although possibly not trivial. 

Write the observations x = (xi, X2). Assume that xi can be missing and denote a 
missing observation by xi — 0. The task is still to predict c by p{c \ xi , X2). 

Since we are given a model for the joint distribution which is assumed to be ap- 
proximately correct, it will be used to model the missing data. We denote this by 
q{c, xi, X2 I 9'), xi j^ 0, with a prior q{9'), 9' E Q . We further denote the parame- 
ters of the missing data mechanism by A. Now, similar to joint density modeling, if the 
priors for the joint model, q{9'), and missing-data mechanism, g(X), are independent, 
the missing data mechanism is ignorable [7]. In other words, the posterior that takes 
the missing data into account can be written as 

p49 \ D) ^ q{e')g{X) II g(c|xi,X2,0') J] q{c\^2,9'), (8) 
ye-D/uii yeD™,aa.,.g 

where y = (c, xi, X2) and we have used to Dfuii and D missing to denote the portions 
of data set with xi ^ and Xi = 0, respectively. 

Equation (8) has been obtained by using q to construct a model family in which 
the data is missing independently at random with probability A, having a prior g{X). 
That is, we define a model family that generates the missing data in addition to the 
non-missing data, 

, r (l-A)g(c,xi,X2 |6l') , xi7^0 

^^"'"'''''''^^-l A/,,pp(,^)^0g(c,x,X2|0')rfx = Ag(c,X2|0') , x^ = 0, 

where 9 € & and supp(xi) denotes the support of xi. The parameter space O is 
spanned by 6'' e O and A. The equation (8) follows directly by applying Theorem 3.1 
to p{c I xi, X2, 9)? Notice that the posterior for 9' of (8) is independent of A. The 
division to xi and X2 can be made separately for each data item. The items can have 
different numbers of missing components, each component k having a probability Afc 
of being missing. 



'Notice that p(c | xi,X2,6) = q{c | xi,X2, 9') when xi ^ 0, and 

p(c I xi = 0,X2,6') = q{c I X2,6'')- 



5 Experiments 

5.1 Implementation of Discriminative Sampling 

The discriminative posterior can be sampled with an ordinary Metropohs-Hastings al- 
gorithm where the standard posterior p(0 I D) oc nr=iP('^*'^* I ^)p(^) is simply 
replaced by the discriminative version pd(^ I D) oc nr=iP('^« I 2;^, 6')p(0), where 

PK(-i I •ii)C'J — p{x,\0) ■ 

In MCMC sampling of the discriminative posterior, the normalization term of the 
conditional likelihood poses problems, since it involves a marginalization over the class 
variable c and latent variables z, that is, p(xi \ 9) — ^^ J^^ ,s p{xi, z, c \ 9)dz. In 
case of discrete latent variables, such as in mixture models, the marginalization re- 
duces to simple summations and can be computed exactly and efficiently. If the model 
contains continuous latent variables the integral needs to be evaluated numerically. 

5.2 Performance as a function of the incorrectness of the model 

We first set up a toy example where the distance between the model family and the true 
model is varied. 

5.2.1 Background 

In this experiment we compare the performance of logistic regression and a mixture 
model. Logistic regression was chosen since many of the discriminative models can be 
seen as extensions of it, for example conditional random fields [8]. In case of logistic 
regression, the following theorem exists: 

Theorem 5.1 [21] 

For a k-class classification problem with equal priors, the pairwise log-odds ratio of 
the class posteriors is ajfine if and only if the class conditional distributions belong to 
any fixed exponential family. 

That is, the model family of logistic regression incorporates the model family of any 
generative exponential (mixture) model^. A direct interpretation is that any generative 
exponential family mixture model defines a smaller subspace within the parametric 
space of the logistic regression model. 

Since the model family of logistic regression is larger, it is asymptotically better, 
but if the number of data is small, generative models can be better due to the restricted 
parameter space; see for example [3] for a comparison of naive Bayes and logistic re- 
gression. This happens if the particular model family is at least approximately correct. 



5.2.2 Experiment 

The true model generates ten-dimensional data x from the two-component mixture 

^2 1 
^3=1 2 



p(x) = X]7=i \ ^ P(^ I A'i' ^j)^ where j indexes the mixture component, and p(x 



Banerjee [21] provides a proof also for conditional random fields. 



fijjd'j) is a Gaussian with mean flj and standard deviation a-j. The data is labeled 
according to the generating mixture component (i.e., the "class" variable) j G {1, 2}. 
Two of the ten dimensions contain information about the class. The "true" parameters, 
used to generate the data, on these dimensions are /ii = 5, (Ti = 2, and jj,2 = 9, (T2 = 
2. The remaining eight dimensions are Gaussian noise with /i = 9, ct = 2 for both 
components. 

The "incorrect" generative model used for inference is a mixture of two Gaussians 
where the variances are constrained to be aj — k ■ jij + 2. With increasing k the model 
family thus draws further away from the true model. We assume for both of the ^j a 
Gaussian prior p(/ij | tti, s) having the same fixed hyperparameters^ m = 7,s = 7 for 
all dimensions. 

Data sets were generated from the true model with 10000 test data points and a 
varying number of training data points (Nd = {32, 64, 128, 256, 512, 1024}). Both 
the discriminative and standard posteriors were sampled for the model. The goodness 
of the models was evaluated by perplexity of conditional likelihood on the test data set. 

Standard and discriminative posteriors of the incorrect generative model are com- 
pared against Bayesian logistic regression. A uniform prior for the parameters /3 of 
the logistic regression model was assumed. As mentioned in subsection 5.2.1 above, 
the Gaussian naive Bayes and logistic regression are connected (see for example [22] 
for exact mapping between the parameters). The parameter space of the logistic model 
incorporates the true model as well as the incorrect model family, and is therefore the 
optimal discriminative model for this toy data. However, since the parameter space of 
the logistic regression model is larger, the model is expected to need more data samples 
for good predictions. 

The models perform as expected (Figure 1). The model family of our incorrect 
model was chosen such that it contains useful prior knowledge about the distribution 
of X. Compared with logistic regression, the model family becomes more restricted 
which is beneficial for small learning data sets (see Figure la). Compared to joint 
density sampling, discriminative sampling results in better predictions of the class c 
when the model family is incorrect. 

The models were compared more quantitatively by repeating the sampling ten times 
for a fixed value of fc = 2 for each of the learning data set sizes; in every repeat also a 
new data set was generated. The results in Figure 2 confirm the qualitative findings of 
Figure 1. 

The posterior was sampled with Metropolis-Hastings algorithm with a Gaussian 
jump kernel. In case of joint density and discriminative sampling, three chains were 
sampled with a burn-in of 500 iterations each, after which every fifth sample was col- 
lected. Bayesian regression required more samples for convergence, so a burn-in of 
5500 samples was used. Convergence was estimated when carrying out experiments 
for Figure 2; the length of sampling chains was set such that the confidence intervals 
for each of the models was roughly the same. The total number of samples was 900 per 
chain. The width of jump kernel was chosen as a linear function of data such that the 
acceptance rate was between 0.2-0.4 [4] as the amount of data was increased. Selection 
was carried out by preliminary tests with different random seeds. 



^The model is thus slightly incorrect even with fc = 0. 
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Missing Data. The experiment with toy data is continued in a setting where 50% of 
the learning data are missing at random. 

MCMC sampHng was carried out as described above. For the logistic regression 
model, a multiple imputation scheme is applied, as recommended in [4]. In order to 
have sampling conditions comparable to sampling from a generative model, each sam- 
pling chain used one imputed data set. Imputation was carried out with the generative 
model (representing our current best knowledge, differing from the true joint model, 
however). 

As can be seen from Figure 1, discriminative sampling is better than joint density 
sampling when the model is incorrect. The performance of Bayesian regression seems 
to be affected heavily by the incorrect generative model used to generate missing data. 
As can be seen from Figure 2, surprisingly the Bayesian regression is even worse than 
standard posterior. The performance could be increased by imputing more than one 
data sets with the cost of additional computational complexity, however. 



Joint density MCMC Discriminative MCMC Bayesian regression 



a) 



b) 




Figure 1: A comparison of joint density sampling, discriminative sampling, and 
Bayesian regression, a) Full data, b) 50% of the data missing. Grid points where the 
method is better than others are marked with black. The X-axis denotes the amount 
of data, and Y-axis the deviance of the the model family from the "true" model (i.e., 
the value of k). The methods are prone to sampling error, but the following general 
conclusions can be made: Bayesian generative modeling ("Joint density MCMC") is 
best when the model family is approximately correct. Discriminative posterior ("Dis- 
criminative MCMC") is better when the model is incorrect and the learning data set 
is small. As the amount of data is increased, Bayesian regression and discriminative 
posterior show roughly equal performance (see also Figure 2). 



5.3 Document Modeling 

As a demonstration in a practical domain we applied the discriminative posterior to 
document classification. We used the Reuters data set [23], of which we selected a 
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1024 2048 




1024 2048 



Figure 2: A comparison of joint density sampling (dotted line), discriminative sam- 
pling (solid line), and Bayesian regression (dashed line) with an incorrect model. X- 
axis: learning data set size, Y-axis: perplexity. Also the 95 % confidence intervals 
are plotted (with the thin lines). Logistic regression performs significantly worse than 
discriminative posterior with small data set sizes, whereas with large data sets the per- 
formance is roughly equal. Joint density modeling is consistently worse. The model 
was fixed to /c = 2, ten individual runs were carried out in order to compute 95% 
confidence intervals, a) Full data, b) 50% of the data missing. 



subset of 1100 documents from four categories (CCAT, ECAT, GCAT, MCAT). Each 
selected document was classified to exactly one of the four classes. As a preprocessing 
stage, we chose the 25 most informative words within the training set of 100 docu- 
ments, having the highest mutual information between classes and words [24]. The 
remaining 1000 documents were used as a test set. 

We first applied a mixture of unigrams model (MUM, see Figure 3 left) [25]. The 
model assumes that each document is generated from a mixture of M hidden "top- 
ics," p(xi I 6) — X]i=i '^{i)p{'^i\Pj)^ where j is the index of the topic, and f3j the 
multinomial parameters that generate words from the topic. The vector x^ contains the 
observed word counts (with a total of Nw) for document i, and 7r(j) is the probabil- 
ity of generating words from the topic j. The usual approach [4] for modeling paired 
data {xj, Ci}^J^^ by a joint density mixture model was applied; c was associated with 
the label of the mixture component from which the data is assumed to be generated. 
Dirichlet priors were used for the /3, tt, with hyperparameters set to 25 and 1, respec- 
tively. We used the simplest form of MUM containing one topic vector per class. The 
sampler used Metropolis-Hastings with a Gaussian jump kernel. The kernel width was 
chosen such that the acceptance rate was roughly 0.2 [4]. 

As an example of a model with continous hidden variables, we implemented the 
Latent Dirichlet Allocation (LDA) or discrete PCA model [26, 27]. We constructed 
a variant of the model that generates also the classes, shown in Figure 3 (right). The 
topology of the model is a mixture of LDAs; the generating mixture component Zc 
is first sampled from tTc. The component indexes a row in the matrix a, and (for 
simplicity) contains a direct mapping to c. Now, given a, the generative model for 
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words is an ordinary LDA. The tt is a topic distribution drawn individually for each 
document d from a Dirichlet with parameters a (zc), that is, Dirichlet(Q! (zc)). Each 
word Wnd belongs to one topic z„d which is picked from Multinomial (tt). The word is 
then generated from Multinomial (/3 {znd, ■)), where Znd indexes a row in the matrix /3. 
We assume a Dirichlet prior for the (3, with hyperparameters set to 2, Dirichlet prior for 
the a, with hyperparameters set to 1, and a Dirichlet prior for tTc with hyperparameters 
equal to 50. The parameter values were set in initial test runs (with a separate data set 
from the Reuters corpus). Four topic vectors were assumed, making the model structure 
similar to [28]. The difference to MUM is that in LDA-type models a document can be 
generated from several topics. 

Sampling was carried out using Metropolis-Hastings with a Gaussian jump kernel, 
where the kernel width was chosen such that the acceptance rate was roughly 0.2 [4]. 
The necessary integrals were computed with Monte Carlo integration. The convergence 
of integration was monitored with a jackknife estimate of standard error [29]; sampling 
was ended when the estimate was less than 5 % of the value of the integral. The length 
of burn-in was 100 iterations, after which every tenth sample was picked. The total 
number of collected samples was 100. The probabilities were clipped to the range 





Figure 3: Left: Mixture of Unigrams. Right: Mixture of Latent Dirichlet Allocation 
models. 



5.3.1 Results 

Discriminative sampling is better for both models (Table 1). 

Table 1 : Comparison of sampling from the ordinary posterior (jMCMC) and discrimi- 
native posterior (dMCMC) for two model families: Mixture of Unigrams (MUM) and 
mixture of Latent Dirichlet Allocation models (mLDA). The figures are perplexities of 
the 1000 document test set; there were 100 learning data points. Small perplexity is 
better; random guessing gives perplexity of 4. 



Model 


dMCMC 


jMCMC 


Conditional ML 


MUM 


2.56 


3.98 


4.84 


mLDA 


2.36 


3.92 


3.14 
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6 Discussion 

We have introduced a principled way of making conditional inference with a discrimi- 
native posterior. Compared to standard joint posterior density, discriminative posterior 
results in better inference when the model family is incorrect, which is usually the case. 
Compared to purely discriminative modeling, discriminative posterior is better in case 
of small data sets if the model family is at least approximately correct. Additionally, we 
have introduced a justified method for incorporating missing data into discriminative 
modeling. 

Joint density modeling, discriminative joint density modeling, and Bayesian regres- 
sion can be seen as making different assumptions on the margin model p(x | 6). Joint 
density modeling assumes the model family to be correct, and hence also the model of 
X margin to be correct. If this assumption holds, the discriminative posterior and joint 
density modeling will asymptotically give the same result. On the other hand, if the as- 
sumption does not hold, the discriminative joint density modeling will asymptotically 
give better or at least as good results. Discriminative joint density modeling assumes 
that the margin model p(x | 9) may be incorrect, but the conditional model p{c \ x,9), 
derived from the joint model that includes the model for the margin, is in itself at least 
approximately correct. Then inference is best made with discriminative posterior as in 
this paper Finally, if the model family is completely incorrect — or if there is lots of 
data — a larger, discriminative model family and Bayesian regression should be used. 

Another approach to the same problem was suggested in [30], where the traditional 
generative view to discriminative modeling has been extended by complementing the 
conditional model for p{c \ x, 9) with a model for p(x | 9'), to form the joint density 
model p{c, x | 9, 9') — p(c \ x, 9)p{x \ 9'). That is, a larger model family is postulated 
with additional parameters 9' for modeling the marginal x. The conditional density 
model p{c \ x, 9) is derived by Bayes rule from a formula for the joint density, p(c, x | 
9), and the model for the marginal p(x | 9') is obtained by marginalizing it. 

This conceptualization is very useful for semisupervised learning; the dependency 
between 9 and 9' can be tuned by choosing a suitable prior, which allows balancing 
between discriminative and generative modeling. The optimum for semisupervised 
learning is found in between the two extremes. The approach of [30] contains our 
discriminative posterior distribution as a special case in the limit where the priors are 
independent, that is, p{9^ 9') — p{9)p{9') where the parameters 9 and 9' can be treated 
independently. 

Also [30] can be viewed as giving a theoretical justification for Bayesian discrim- 
inative learning, based on generative modeling. The work introduces a method of ex- 
tending a fully discriminative model into a generative model, making discriminative 
learning a special case of optimizing the likelihood (that is, the case where priors sepa- 
rate). Our work starts from different assumptions. We assume that the utility functions 
can be different, depending on the goal of the modeler. As we show in this paper, the 
requirement of our axiom 5 that the utility should agree with the posterior will eventu- 
ally lead us to the proper form of the posterior. Here we chose conditional inference as 
the utility, obtaining a discriminative posterior. If the utility had been joint modeling of 
(c, x), we would have obtained the standard posterior (the case of no covariates). If the 
given model is incorrect the different utilities lead to different learning schemes. From 
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this point of view, the approach of [30] is principled only if the "true" model belongs 
to the postulated larger model family p{c, x | 9, 6'). 

As a practical matter, efficient implementation of sampling from the discriminative 
posterior will require more work. The sampling schemes applied in this paper are sim- 
ple and computationally intensive; there exist several more advanced methods which 
should be tested. 
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A Proofs 

We use the notation r = (c, x) and denote the set of all possible observations r by R. 

For purposes of some parts of the proof, we assume that the set of possible observations R is finite. This 
assumption is not excessively restrictive, since any well-behaving infinite set of observations and respective 
probabilistic models can be approximated with an arbitrary accuracy by discretizing R to sufficiently many 
bins and converting the probabilistic models to the corresponding multinomial distributions over R. 

Example: Assume the observations r are real numbers in compact interval [a, 6] and they are modelled 
by a well-behaving probability density p(r), that is, the set of possible observations R is an infinite set. 
We can approximate the distribution p{x) by partitioning the interval [a, b\ into A'^ bins I{i) = [a + (i — 
l)(fe — a)/N, a + i{b — a)/N] of width (b — a.)/N each, where i G {1, . . . , A''}, and assigning each bin 
a multinomial probability 8i = J,, ., p{r)dr. One possible choice for the family of models p{r) would be 
the Gaussian distributions parametrized by the mean and variance; the parameter space of these Gaussian 
distributions would span a 2-dimensional subspace 6 in the Af — 1 dimensional parameter space © of the 
multinomial distributions. 
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A.l From exchangeability F(/i(c',x'),prf(^ | (c, x))) = F{h{c,x.),pd{9 
(c', x'))) it follows that the posterior is homomorphic to multi- 
plicativity 

4. Exchangeability: The value of the posterior is independent of the ordering of the 
observations. That is, the posterior after two observations (c,x) and (c',x') is 
the same irrespective of their ordering: F{h{{c' ,x'),9),pd{0 \ (c, x)UD)) = 
Fihiic,^),9),pd{0\{c',^')UD)). 

Proof For simplicity let us denote x — h{c',x'), y ~ h{c,x.), and z = p{6). The 
exchangeability axiom thus reduces to the problem of finding a function F such that 

Fix,F{y,z))^F{y,F{x,z)) , (10) 

where F{y,z) ~ pd{0 \ (c, x)). By denoting F{x,z) = u and F{y,z) = v, the 
equation becomes F{x, v) ~ F{y, u). 

We begin by assuming that function F is differentiable in both its arguments (in a 
similar manner to Cox). Differentiating with respect to z, y and x in turn, and writing 
Fi{p,q) for ^^^ and F2 (p, q) for ^^^ , we obtain 

F2{x,v)^ = F,{y,u)^ (11) 

oz oz 

dv 
F2{x,v)— = Fiiy,u) (12) 

dy 

du 
Fi{x,v) = ^2(2/,w)^ . (13) 

Differentiating equation (11) wrt. z, x and y in turn, we get 

^2.(x,.)(|)'+f.(x,.)|^ - ^22(j/,.)(|)%F,(y,.)|J(14) 

ay oz ozoy oz 

and differentiating equation (12) wrt. x we get 

Fu{x,v)^ ^ F,2{y,u)^ . (17) 

Oy ox 

By solving Fi2{x, v) from equation (15) and Fi2{y, u) from equation (16) and insert- 
ing into equation (17), we get 

F22{mu)^tJt.+P^iy^^)£t dv ^ F22{x,v)^^^+F2ix,v)£i^ du 
P, ,(duV ^. /dvY F2{x,v)^^^^-F2{y,u)£g^^ 

F22[y,u)[--] -F22{X,V)[ — ^ 



dz '"' ' ' \dz |2i|H 

OX oy 



(18) 
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By inserting equation (18) into equation (14), we get 

ax ay 

F2{x,v) fd^vdudv d^v dudv\ d^ududv d^u dudv 



F2{y,u) ydz^ dx dy dzdydxdzj dz'^dxdy dzdx dz dy 

Notice that by equation (1 1) we can write p ('^A = ff /ff • Inserting this into equa- 
tion (19), and dividing by f^ ff ff, the equation simplifies to 

aP' dzdy _ -g^ dzdx 



dv dv du du 

dz dy dz dx 



The equation can be written also as 



d dv d dv d du d du 

dz dz dz dy dz dz dz dx 





(20) 



Since the left hand side depends on y, z and right hand side depends on x, z, it follows 
that both must be functions of only z. Furthermore, since the derivative of the logarithm 
of a function is a function of z, the function itself must be of the form 



On the other hand, dividing equation (12) by equation (13), we get 
F2{x,v) dv f F2{y,u) du^ 



Fi{x,v)dy \Fi{y,u)dx 
By equation (21), ^^^ = |^. Inserting this, we get 

$i(x) dv f^i{y) du 



(21) 



(22) 



*2(w) dy V*^2(u) dx 

^i{y) dv f^i(x)du 



^2{v) dy V*^2(u) dx 



(23) 



Now, since left hand side depends only on {y, z) and right hand side on {x, z), each 
must be a function of z only, that is g{z). Furthermore, we note that we must have 

giz) = ^ ^ g{z) = ±1 . (24) 

9{z) 
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Since the condition must be fulfilled for all x,y,we must have 

^i{y)dv ^ ^ 







$. 


2{V] 


)dy 






V$2(w) dy^ 


r- 


for each 


1 y as well. 


Summing these, we 


get 








^iiy)dv ^ 
$2(v) dy 


$2 


(z;) 1 






$1 


(y)il 






fdv 




-(y)y 


We can 


then write 












dv 




$2(w) 






dy 




<J>i(2/) 






dv 




$2(«) 






dz 




$2(2) 



Combining these to obtain the differential dv we get 

dv dy dz 



(25) 



^2{v) $i(y) $2(2) 
By denoting / -^^ = In fk{p), we obtain 

C/2(z^) = /i(y)/2(^) . (26) 

The function /i can be incorporated into our model, that is fi o h h^ h. By inserting 
V = F{y, z) we get the final form 

Cf2iFiy,z))^hiy)hiz) . (27) 

n 
A.2 Mapping fromp(c | x, 6) to /i(r, 9) is Monotonically Increasing 

Proposition A.l From axiom 5 it follows that 

loghir,9) = fc{logp{c\^,9)) (28) 

where fc is a monotonically increasing function. 

Proof Denoting Or = p(r|0) in inequalities (5) and (6) in the paper we can write them 
in the following form 

Y, Sr log h{v, e^)<Y,Or log h{Y, 62) (29) 

rG-R reR 
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t 



^9r l0gp(clx,ei) < ^^, logp(c|x,02) 
rGK reR 



(30) 



Consider the points in the parameter space G, where 9k = I and 9i = for i ^ k 
("corner points"). In these points the Hnear combinations vanish and the equivalent 
inequahties (29) and (30) become 



logh{rk,9i) < logh{rk,92) 

t 

log p{ck\xk, 9 1) < \ogp{ck\xk,92) 



(31) 



Since the functional form of fc must be the same regardless of the choice of 9, 
equivalence (31) holds everywhere in the parameter space, not just in the corners. 

From the equivalence (31) (and the symmetry of the models with respect to re- 
labeling the data items) it follows that h(r, 9) must be of the form 

\ogh{v,9) = fc{\ogp{c\^,9)) 

where fc is a monotonically increasing function. D 

A.3 Mapping is of Form /i(r, 9) = exp(/5) p{c \ x, 6)"^ 

Proposition A.2 For a continuous increasing function fc{t)for which 

\ogh{r,9)^fcilogp{c\^,9)) 

it follows from axiom 5 that 

fcit) = At + p , 

or, equivalently, 

h{r, 9) = exp(/3) p(c|x, 9)^ with A>Q. (32) 

Proof Note,that we can decompose Kcond as 

Kcond{9,9)^S{9)-R{9,9) , (33) 

where S'(6') = T.p(\e)^ogp{c\x9) and R{9, 9) = J2p(\e)'^ogp{c\x9). Consider any 

9 and the set of points 9 that satisfy R{9, 9) = t, where t is some constant. From 
the fifth axiom (the equality part) it follows that there must exist a constant fg{t) that 
defines the same set of points 9, defined by X)reijP('"l^) ^'^Sh{i^, 9) = /^(t)- From 
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the inequality part of the same axiom it follows that fg is a monotonically increasing 
function. Hence, 



J2pir\0)ioghir,e) = fg ^Mr|e~)logMc|x,0) 

rS-R \re_R / 

On the other hand, from equation (28) we know that we can write 

\ogh{r,9)^fc{logpic\^,9)) . 
So, equations (34) and (35) lead to 



(34) 



(35) 



fg ij2pir\0)\ogp{c\^,9) = 5]p(r|0)/c (logp(c|x,0)) . (36) 



VrGi? 



refl 



If we make a variable change Ui — logp{ci \ x^, 9) and denote p{ri\9) = 9i for 
brevity, equation (36) becomes 



/«fE^^^M=E' 



fc (u,) 



(37) 



Not all Ui are independent, however: for each fixed x, one of the variables ui is 
determined by the other u^'s 



exp(ui) = p{ci I Xj 



and 



^ p{ci I X, 6*) = 1 ^ ^ exp(ui) = 1 . 

fixed X fixed x 

So the last ui for each x is 



ui 



log 1 - E exp(u„) 



(38) 



fixed : 



where the sum only includes the independent variables Um for the fixed x. This way 
we can make the dependency on each Ui explicit in equation (37): 



/« 



E 

indep. Uj 



Y^ ^j log 1 - Y^ cxp(i 



dependent ui 



fia:ed. 



E ^jfc{uj)+ E 9ifc\ log I- E cxp(u™) 

dependent ui \ \ 



indep. u 



fixed z 



) ) 



(39) 
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Let us differentiate both sides with respect to a u^: 



f'e{T.^^u}j 



9k - — exp(ufe) 
ui 



= Ok fciuk) - ^ f'c{ui) cxp(ufe) . (40) 

For all such variables Uk that share the same x, we get 

9k - Cx exp(ufe) = 9k fciuk) - c^ d^ cxp{uk) 

t 
9k exp(-Ufe) ifciuk) - a) ^ c^ {d^ - a) . 

Since the right-hand side only depends on x, not on individual Uk, the left-hand 
side must also only depend on x and the factors depending on Uk must cancel out. 



/c("fe) - a = Sa 



cxp(ufe) 



r§\y2^^u^ ^fi 



ciuk) - Bx 



cxp(ufe) 



does not depend on uu depends on Uk 

Since the left-hand side depends neither on Uk nor x, both sides must be constant 

=^ m = A 

Substituting (41) into equation (37) we get 



(41) 



A lY,9,u^+p = Y,0^fci^ 



t 

Y,0^ {Au,-fc{Ui))^-P , 
i 

and since this must hold for any parameters 9, it must also hold for the corner points: 

Au^- fciui) = -/3 
=^ fcit)=At + l3 . (42) 

n 
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A.4 Axiom 6 Implies Exponent A = 1 

Proposition A.3 Frotn axiom 6 it follows that A ~ \. 

Proof Without axiom 6, the discriminative posterior would be unique up to a positive 
constant A. 

Pd{e\D)^p{e) H p(c|x,0)^. (43) 

{c,x)eD 

Axiom 6 is used to fix this constant to unity by requiring the discriminative poste- 
rior to obey the Bayesian convention for a fixed x, that is, the discriminative modehng 
should reduce to Bayesian joint modeling when there is only one covariate. We require 
that for a fixed x and a data set D^ = {(c, x') e D | x' = x} the discriminative 
posterior matches the joint posterior p^ of a model p^, where p^{c \ 9) = p{c \ x, 9), 

p^{9\D,)^p{9)l[p-{c\9)^p{9) n Mc|x,0), (44) 

Clearly, A ^ 1 satisfies the axiom 6, i.e., p^{9 \ Dx) equals pd{9 \ D^) for all G O. 
If the proposal would be false, the axiom 6 should be satisfied for some A^ 1 and for 
all 9 and data sets. In particular, the result should hold for a data set having a single 
element, (c, x). The discriminative posterior would in this case read 

Pd{e\Dx) = ^p{9)p{c\^,e)^, (45) 

where D^, ~ {(c, x)} and Za is a normalization factor, chosen so that the posterior 
satisfies \ Pd(§ \ D)d9 = 1. The joint posterior would, on the other hand, read 



1 

Z'l' 



p^9\Dx) = —p{9)p{c\^,e). (46) 



These two posteriors should be equal for all 9: 

l = ^4^ = |^Mc|x,.)--. (47) 

p^{9 I Dx) Za 

Because the normalization factors Zi and Za in equation (47) are constant in 9, also 
p{c I x, 6')^^^ must be constant in 9. This is possible only if A ~ 1 or p{c \ x, 9) is a 
trivial function (constant in 9) for all c. 

D 
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